Do not sanitize append vecs at startup#6552
Conversation
| /// This version of `new()` may only be called when reconstructing storages as part of startup. | ||
| /// It trusts the snapshot's value for `current_len`, and relies on later index generation or | ||
| /// accounts verification to ensure it is valid. | ||
| pub fn new_for_startup( |
There was a problem hiding this comment.
What's the use case for new_from_file compared with new_from_startup?
I understand the difference between the two, but when will we want to actually use new_from_file instead?
There was a problem hiding this comment.
Yeah, new_from_file() is now only called in tests, benches, and store-tool.
There was a problem hiding this comment.
Is there a reason it isn't marked DCOU?
There was a problem hiding this comment.
It probably can be. I wouldn't want to do that in this PR though.
There was a problem hiding this comment.
Why? It seems relevant to this PR: You are replacing the function with a new implementation and deprecating the old one for non-test usage.
There was a problem hiding this comment.
I'm not deprecating the old one though. It may currently be called only by dcou-stuff, but should we remove/deprecate/make-dcou it? I'm not sure; it seems like a normal/useful constructor, esp. since it is safe and can be called anywhere. I'd personally rather leave it for now. Maybe we revisit that? But that's why I don't want it within this PR.
There was a problem hiding this comment.
(drive by from bp pr) +1 deprecation or dcou
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6552 +/- ##
=======================================
Coverage 82.8% 82.8%
=======================================
Files 849 849
Lines 379166 379187 +21
=======================================
+ Hits 314064 314088 +24
+ Misses 65102 65099 -3 🚀 New features to boost your workflow:
|
| storage_access: StorageAccess, | ||
| ) -> Result<Arc<AccountStorageEntry>, SnapshotError> { | ||
| let (accounts_file, _num_accounts) = | ||
| AccountsFile::new_from_file(append_vec_path, current_len, storage_access)?; |
There was a problem hiding this comment.
so new_from_file is used for only tests now?
|
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
(cherry picked from commit 3804f39) # Conflicts: # runtime/src/serde_snapshot.rs
(cherry picked from commit 3804f39)
Problem
At startup from a snapshot, we must rebuild/reconstruct all the
AppendVecs. This is basically relinking all the storage files on disk to theAccountStorageEntrys that were serialized into the bank snapshot (aka snapshot manifest). We use the storage file paths from the snapshot manifest to reconstruct the AppendVecs.We also do some sanitization of the storage files when reconstructing the new AppendVecs. One step is to sanitize each account in the storage file. This involves scanning (i.e. reading) the whole file, and checking every single account to make sure the accounts are valid. Scanning every account in every storage file is rather time consuming. On mainnet-beta, we now have over 900 million accounts. Doing this sanitization step takes over 100 seconds.
An observation is that we also read all the accounts/storages two other times as part of startup. One is for index generation, and the other is for startup accounts verification. If the storages are wrong/invalid, then one (or both) of these other tasks will catch it. Thus, we can avoid the sanitization step during startup reconstruction.
Speeding up reconstructing the storages, speeds up startup. And if startup is faster, that means validators begin replaying faster. If validators begin replaying faster, they catch up faster, AND that also means there are fewer blocks they have to repair (vs getting through turbine).
Summary of Changes
Skip AppendVec accounts sanitization during startup when reconstructing storages.
Testing against mnb, I saw times drop from ~100-120 seconds to ~2 seconds for reconstructing storages at startup.